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Abstract 

Bayesian model averaging (BMA) is a common approach to average over alternative 
models; yet, it usually gets excessively concentrated around the single most prob- 



, 1 ' able model, therefore achiev ing only sub- optimal classification performance. The 



20071) overcomes this problem; it averages over 



Y^ \ compression-based approach (IBouUe , 

the different models by applying a logarithmic smoothing over the models' posterior 
\/J ■ probabilities. This approach has shown excellent performances when applied to en- 

^^O ' sembles of n aive Bayes c l assifie rs. AODE is another ensemble of models with high 



2005): it consists of a collection of non-naive classifiers 



t^^ ' performance (IWebb et al. , 

^^ ■ mean. Aggregating the SPODEs via BMA rather than by arithmetic mean deteriorates 

(N 

the performance; instead, we propose to aggregate the SPODEs via the compression 

coefficients and we show that the resulting classifier obtains a slight but consistent 

• 1-^ 

^^ , improvement over AODE. However, an important issue in any Bayesian ensemble of 

H ■ 

C^ ■ models is the arbitrariness in the choice of the prior over the models. We address this 

problem by adopting the paradigm of credal classification, namely by substituting the 

unique prior with a set of priors. Credal classifier are able to automatically recognize 

the prior-dependent instances, namely the instances whose most probable class varies, 

when different priors are considered; in these cases, credal classifiers remain reliable by 

returning a set of classes rather than a single class. We thus develop the credal version 
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of both the BMA-based and the compression-based ensemble of SPODEs, substituting 
the single prior over the models by a set of priors. By experiments we show that both 
credal classifiers provide overall higher classification reliability than their determinate 
counterparts. Moreover, the compression-based credal classifier compares favorably to 
previous credal classifiers. 

Keywords: classification, Bayesian Model Averaging, compression coefficients, 
AODE, credal classification, imprecise probability 



1. Introduction 



Bayesian model averaging (BMA) JHoeting et al.L 



1999 ) is a sound solution to the 



uncertainty which characterizes the identification of the supposedly best model for a 
certain data set; given a set of alternative models, BMA weights the inferences pro- 
duced by the various models, using the models' posterior probabilities as weights. 
BMA assumes the data to be generated by one of the considered models ; under this as 



sumption, it provides better predictive accuracy than any single model (IHoeting et al 



19991) . However, such an assumption is generally not true; for this reason, on real data 
se ts BMA does no t gener ally perform very well; see the discussion and the references 



m 



Cerquides et al. 



(l2005b for more details. The probl em is that BMA gets excessivel y 



concentrated around the single most probable model (IDomingos , 



200C; 



Minka, 



2002): 



especially on large data sets, ''averaging using the posterio r probabilities to weight 



the models is almost the same as selecting the MAP modeF JBoulle , 



2007). To over- 



come the problem of BMA getting excessively concentrated ar ound the most probable 



2007 ); it com- 



model, a compression-based approach has been introduced in (iBoulla 
putes more evenly-distributed weights, by applying a logarithmic smoothing to the 
models posterior probabilities. The compression-based weig hts, which can be justified 
from an information-theoretic viewpoint, have been used in iBoulld (120071) to average 



over different naive Bayes classifiers, characterized by different feature sets, obtaining 
excellent rank in international competitions on classification. 

Anoth er ensemble ofBay esian networks classifiers known for its good performance 



isAODE dWebbetal. . 



20051) . which is instead based on a set of SPODE (SuperParent- 



One-Dependence Estimator) models. Each SPODE adopts a certain feature as a super- 
parent, namely it models all the remaining features as depending on both the class and 
the super-parent. AODE then simply averages the posterior probabilities computed by 
the different SPODEs. Alternat ive methods to agg regate SPODEs, more complex than 



AODE, have been considered dYang et al.L 12007b . but AODE generally outperforms 
them: ''AODE, which simply linearly combines every SPODE without any selection 
or weighti ng, is actually more effective than the m ajority of rival schemes'" . As re- 



ported in (ICerquides et al. , 



2005; 



SPODEs via BMA; in both 



Yang et al. 



Yang et al. 



2007 



20071). AODE outperfo rms aggregating 



Cerquides et al. 



2005) the best results 



were instead obtained using an algorithm (called MAPLMG), which estimates the most 
probable linear mixture of SPODEs; this overcomes the problem of assuming a single 
SPODE to be the true model. In this paper, we address this problem by means of the 
compression coefficients. 

As a preUminary step we develop BMA- AODE, namely BMA over SPODEs, with 
so me computational differences with respect to the framework of 



and 



Cerquides et al. 



Yang et al. 



(120071) 



(l2005h : our results confirm however that BMA over SPODEs is 
outperformed by AODE. Then we develop the novel COMP-AODE classifier, which 
weights the SPODEs using the compression-based coefficients, and we show that it 
yields a slight but consistent improvement in the classification performance over the 
standard AODE. Considering the high performance of AODE, we regard this result as 
noteworthy. 

An important issue in any Bayesian ensemble is choosing the prior over the models. 
A common choice is to adopt a uniform mass function, as we do in both BMA- AODE 
and COMP-AODE; this however can be criti cize d from differen t standpoints; see for 



instance the rejoinder in 



Hoeting et al. 



(119991) . In iBoulld (120071) . a prior which favors 



simpler models over complex ones is adopted. Although all these choices are reason- 
able, the specification of any single prior implies some arbitrariness, which entails the 
risk of prior-dependent, and hence potentially fragile, conclusions. 

In fact, the specification of the prior over the models is a serious open problem for 
Bayesian ensembles of models. We address this problem by adopting t he paradigm of 



credal classification (iCorani et al 



20121: 



Corani and Zaffalon 



2008bl) . namely drop- 



ping the unique prior in favor of a set of priors (prior credal set) jLevil I1980I) . While 
a traditional non-informative priors represents a condition of indifference between the 
alternative models, a credal set describes a condition of prior ignorance, letting thus 
vary the prior probability of each model over a wide interval, instead of fixing it to a 
specific number. Credal classifiers are able to automatically detect the instances whose 
most probable class varies when different priors are considered; such instances are 
called prior-dependent. Credal classifiers remain reliable on prior-dependent instances 
by returning a set of classes; trad itional classifiers have instea d typically low accuracy 



on the prior-dependent instances (ICorani and Zaffalon 



2008a b). 



We then develop BMA-AODE* and COMP-AODE*, namely the credal counter- 
parts of respectively BMA-AODE and COMP-AODE. By extensive experiments we 
show that both credal classifiers are sensible extension of their single-prior counter- 
parts; in fact, they return a small-sized but highly accurate set of classes on the prior- 
dependent instances, over which instead their single-prior counterparts have reduced 
accuracy. We conclude by showing that COMP-AODE* compares favorably to both 
BMA-AODE* and other existing credal classifiers. 

2. Methods 

We consider a classification problem with k features; we denote by C the class 
variable (taking values in C) and by A := (Ai, . . . , Ak) the set of features, taking 
values respectively in ^i, . . . , Ak- For a generic variable A, we denote as P{A) the 
probability mass function over its values and as P{a) the probability that A ~ a. We 
assume the data to be complete and the training data V to contain n instances. We learn 
the model parameters from the training data by adopting Dirichlet priors and setting the 
equivalent sample size to 1. Under 0-1 loss a traditional probabilistic classifier returns, 
for a test instance a = {cTi, . . . , dk] whose class is unknown, the most probable class 
c*: 

c* :— argmaxP(c|a). 

Classifiers based on imprecise-probabilities {credal classifiers) change this paradigm, 
by occasionally returning more classes; this happens in particular when the most proba- 



ble class is prior-dependent. We discuss this point more in detail later, when presenting 
credal classifiers. 

2.1. From Naive Bayes to AODE 

The Naive Bayes classifier assumes the stochastic independence of the features 
given the class; it therefore factorizes the joint probability as follows: 

fc 
P(c,a):=P(c).[]P(a,|c), (1) 

corresponding to the topology of Fig jl(a)| Despite the biased estimate of probabilities 
due to the above (so-called na i ve) as sumption, naive Bayes performs well under 0- 



1 loss JDomingos and Pazzani , 



19971) : it thus constitutes a reasonable choice if the 
goal is simple classification, without the need for accurate probability estimates; it is 
espec ially competitive on data sets of small and medium size , thanks to its low variance 



error ( Friedman 



119971). 

To improve the model, weaker assumptions about the conditional independence of 
the features have to be considered; for instance, the tree-augmented naive classifier 
(TAN) allows each feature to depend on the class and on possibly another single fea- 
ture, constraining however the subgraph involving only the features to be a tree; an 
exar nple is shown in Fig|l(b)| Generally, TAN outperforms naive Bayes in classifica- 



tion dFriedman et al. , 



1997h . 





(a) Naive Bayes. (b) A possible TAN structure. 

Figure 1: Naive Bayes vs TAN. 



The AODE classifier (IWebb et al 



2005h is an ensemble of k SPODE (SuperPar- 



ent One Dependence Estimator) classifiers; each SPODE is characterized by a certain 
super-parent feature, so that the other features are modeled as depending on both the 
class and the super-parent, as shown in in Fig|2] In fact, each single SPODE is a TAN. 




Figure 2: SPODE with super-parent Ai. 

We denote the set of SPODEs as S := {si, . . . ,Sk}, where Sj indicates the SPODE 
with super-parent Aj. The joint probability of SPODE Sj factorizes as: 

k 

P(c, a|s,) = P{c) ■ P{a,\c) ■ J] ^(«'I«J'C). 

i=i..k,i^j 

In orderto classify the test instance a, AODE averages the posterior probability P(c|a, Sj) 
computed by each single SPODE: 

In this paper we focus on more sophisticated approaches for aggregating the predictions 
of the SPODEs. 

2.2. Bayesian Model Averaging (BMA) with SPODEs 

BMA assumes that one of the models in the ensemble is the true one. Under this 
assumption, the optimal strategy is to weight the inferences produced by the models 
of the ensemble using as weights the models' posterior probabilities. By applying 
BMA on top of different SPODEs, we thus assume one of the SPODEs to be the true 
model. We thus introduce a variable S over S, where P{S — Sj) denotes the prior 
probability of SPODE Sj to be the true model. Considering that every SPODE has 
the same number of variables, the same number of arcs and the same in-degreqj, we 
adopt a uniform prior, thus assigning prior probability 1/fc to each SPODE. In fact, the 
uniform prior over the models is frequently adopted within BMA. To classify the test 



'Tlie in-degree is the maximum number of parents per node: it is two for any SPODE. 



instance a, BMA computes the following posterior mass function: 

fc fc 

where P(X'|sj) is the marginal likelihood of Sj, namely 

with 6j denoting the parameters of SPODE s,. This computational schema has been 



2005 : 



Yangetal. , 



2007h . 



adopted to implement BMA over SPODEs in dCerquides et al. , 
and has been outperformed by AODE. 

The marginal likelihood measures how good the model is at representing the joint 
distribution; yet, a classifier has instead to estimate the posterior probability of the 
classes conditionally on the features. Therefore , a model can perform badly at classifi - 



cation despite having high marginal likelihood dCowell , 



2001 : 



Kontkanen et al. 



19991) : 



for this re ason, scoring r ules more appropriate for classification should be considered. 



Following 
likelihood: 



Boullg (120071) . we thus substitute the marginal likelihood with conditional 



L,:= 



nn 



W|;,W 



c^ '\a 



|aW,s„0,), 



(2) 



where Pijz'^^'^ \ a*^*^ , Sj , 6j ) denotes the probability assigned by model Sj to the true class 
of the i-th instance, and 9j is the estimate of the parameters of model Sj. 

We call BMA-AODE the classifier which estimates the posterior probabilities of 
the class, given the test instance a, as follows: 

fc 

P(c|&)«^P(c|a,s,).i,.P(sj). (3) 

Especially on large data sets, the difference between the likelihoods of the dif- 
ferent SPODEs might be of several order of magnitudes. We remove from the en- 
semble the SPODEs whose conditional likelihood is smaller than L,„ax/10'*, where 
■C'max is the maximum conditional likelihood among all SPODEs; discarding models 
with very low posterior probability is in fa ct common when dealing with BMA; this 



procedure can be seen as a belief revision (IDubois and Pradel Il997h . Given the joint 



beliefs P{X,Y), the revision P'{X,Y) induced by a marginal P'{Y) is defined by 
P'{x, y) := P{x\y) ■ P'{y)- In other words, if P'{y) is known to be a better model 
than P{y) for the marginal beliefs about y, this information can be used in the above 
described way to redefine the joint. Accordingly, in BMA-AODE, the marginal beUefs 
about S have been replaced by a better candidates, inducing a revision in the corre- 
sponding joint model. 

2.2.1. Exponentiation of the Log-Likelihoods 

Regardless whether the marginal likelihood or the conditional likelihood is consid- 
ered, it is common to compute the log-likelihood rather than the likelihood, in order to 
avoid numerical problems due to the multiplication of many probabilities. However, if 
the log-likelihoods are very negative, as it happen on large data sets, their exponenti 



ation c an suffer numerical problems too. This issue has been addressed in 



Yang et al. 



(120071) by means of high numerical precision: ''BMA often lead to arithmetic overflow 
when calculating very large exponentials or factorials. One solution is to use the Java 
class BigDecimal which unfortunately can be very slow." Algorithm[T|describes a pro- 
cedure for exponentiating the log-likelihoods, which is both numerically robust and 
computationally fast. The proced ure has been communica ted to us by D. Dash, who 



published several works on BMA JDash and Cooper 



2004 . 



2.3. BMA-AODE*: Extending BMA-AODE to Sets of Probabilities 



1991) 



By BMA-AODE* we extend BMA-AODE to imprecise probabilities JWallev 
allowing multiple specifications of the prior mass function P{S); we denote the credal 
set containing such prior mass functions as 'P{S). While a uniform prior represents 
prior indifference between the different SPODEs, the credal set represents a condition 
of prior ignorance about S, letting the prior probabihty of each SPODE vary within a 
large range. In principle we could let the prior probability of each SPODE vary exactly 
between zero and one {vacuous model). Yet, this would gener ate vacuous posterior in- 



ferences, thus preventing learning from data (tPiatti et al. , 



20091) . To obtain non- vacuous 



posterior inferences, we introduce non-zero lower bounds for the prior probability of 



Algorithm 1 Robust exponentiation of log-likelihoods. 



Require: Array log_liks of log-likelihoods, assumed of length k. 

minVal=min (log_liks) 

for i = 1 : k do 

shifted-logliks (i) =logliks (i) -minVal; 
tmp_liks (i) =exp (shif ted_logliks (i) ) ; 
end for 

total=suin (tmp_liks) 

for i = 1 : k do 

liks (i) =tmp_liks (i) /total; 
end for 

return liks {Array proportional to the exponentiated likelihoods} 



the models. The resulting credal set is defined by the following constraints: 



V{S) := <^ P{S) 






(4) 



The prior probability of each SPODE varies thus between e and 1 — (fc — l)e. The set 
of mass functions in Eq.(|4|i is convex; its k extreme mass functions are those assigning 
mass e to all the SPODEs apart from a single one, to which 1 — (fc — l)e is assigned. The 
constant e appears in other places of this paper; in the implementation we set e = 0.01 
for all occurrences of e. 

The credal set in (|4]l models the fact that, before observing the data, we are ignorant 
about the probability of each SPODE to be the true model. Considering that 7'(S') is a 
set a prior mass functions, BMA- AODE* can be regarded as a set of BMA- AODE clas- 
sifiers, each corresponding to a different prior The most probable class of an instance 
might happen to vary, when all the different priors of the credal set are considered; 



in this case the classificatio n is prior-dependen t . When dealing with 



instances, credal classifiers (ICorani et al. 



2012; 



Corani and Zaffalon , 



jrior-d ependent 



2008b) become 



indeterminate, by returning a set of classes instead of a single class. 

Before discussing how this set of classes is identified, let us introduce the concept 



of credal dominance (or, for short, dominance): class c' dominates class c" if c' is 
more probable than c" under each prior of the credal set. If no class dominates c', 
then c' is non-dominated. Credal classifiers return in particular all the non-dominated 
classes, identified performing diffe rent by pairwi se dominance tests among classes. 



This criterion is called maximality dWallev , 



Algorithm |2] We point the reader to (ITroffaes 



1991, Sec tion 3.9.2) and is described by 



20071) for a discussion of altemative 



criteria for taking decisions under imprecise probabihties. 

Algorithm 2 Identification of the non-dominated classes JVV through maximality 
AfV:=C 

for c' e C do 

for c" e C (c" ^ c') do 

check whether c' dominates c" 

if c' dominates c" then 
remove c" from MV 
end if 

end for 
end for 

return J^V 



Non-dominated classes are incomparable, namely there is no available information 
to rank them. Credal classifiers can be thus seen as dropping the dominated classes and 
expressing indecision about the non-dominated ones. 

Within BMA-AODE*, c' dominates c" if the solution of the following optimization 
problem is greater than one: 



minimize: 



ELP(c"|a,s,)-L,.P(s,) 



subject to: P{sj) > ( Vj = 1, , 



^fc 



E=i^(s,) = i, 



10 



Note that the constrains of the problem correspond to the definition of credal set given 
in EqH] The above optimization task is a fractional-linear program; it can be mapped 



into a linear program by the Charnes-Cooper transformation (see Appendix Appendix A i 
and then solved exactly. 

As already discussed for BMA-AODE, we include in the computation only the 
SPODEs whose conditional likelihood is at least Lniax/lC. This can be regarded as a 
belief revision process, involving the credal set. The marginal credal set P'{Y) induces 
the following revision of the joint credal set 7'(X, Y): 



V'iX,Y) := {P'{X,Y) 



P'{x,y):=P{x\y)-P'{y) 
P'{Y) e V'{Y) 

It is worth emphasizing that the prior credal of BMA-AODE* includes the uniform 
prior adopted by BMA-AODE; therefore, the set of non-dominated classes identified 
by BMA-AODE* includes the most probable class returned by BMA-AODE; if in 
particular BMA-AODE* returns a single non-dominated class, this coincides to the 
class returned by BMA-AODE. 

2.4. Compression-Based averaging 



Compression-based averaging has been introduced by (IBoulle . 



2007) as a rem- 



edy against the tendency of BMA at getting excessively con centrated around the most 



proba ble model, which indeed deteriorates the performances dBouUe , 



2007 : 



Domingos , 



2000'). This approach replaces the posterior probabilities P{sj\'D) of the models by 
smoother compression weights, which we denote as P'{sj |2?) for model Sj. Note that 
also the adoption of the compression coefficients in place of the posterior probabilities 
can be seen as a belief revision. 

To present the method, we need some further notation. In particular, we denote by 
LLj the log of the conditional likelihood of model Sj . We moreover introduce the null 
classifier as a Bayesian network with no arcs, which models the class as independent 
from the features and whose probabilistic classifications correspond to the marginal 
probabilities of the classes. The null classifier will be used for the computation of the 
compression coefficients. We denote the null classifier as s^; therefore we associated a 
further state sq to S, whose domain thus becomes {sq, si, . . . , Sk}- We denote as LLq 



11 



the conditional log-likelihood of the null classifier. It has been shown dBouUe , l2007h 
that LLo = -nH{C), where H{C) := - J^cec ^(c) log P{c) is the entropjO of the 
class. 

Since we are dealing with a traditional single-prior classifier, we set a single prior 
mass function over the models, assigning uniform prior probability to the various 
SPODEs but prior probability e to the null model; assigning a prior probability to the 
null model is necessary, since its posterior probability appears in the compression co- 
efficients. Thus, we define the prior over variable S as follows: 



Pis,) 



l-e 
fe 



J = 0, 
.7-1,. 



(5) 



The compression coefficients are computed in two steps; computation of the raw com- 
pression coefficients and normaUzation. The raw compression coefficient associated to 



SPODE Si is: 



logPjs.lV) 
TTj := 1 - ; — — r;^ = 1 



LL J + log Pjsj) 



= 1 - 



LL, 



log^ 



(6) 



\ogP{so\V) ^ LLo + logP{so) ^ -nH{C) +loge 
A negative ttj means that Sj is a worse predictor than the null model; a positive ttj 
means that Sj is a better predictor than the null model, which is the general case in 
practical situations. The upper limit of ttj is one: in this case Sj is a p erfect predictor, 



with likelihood 1 , and thus log-likelihood 0. Following dBoulle , 



20071), we keep in the 



ensemble only the feasible models, namely those with ttj > 0; we instead discard the 
models with ttj < 0. Also this procedure corresponds to a belief revision induced by 
the removal from the ensemble of the models whose posterior probability falls below 
a certain threshold. Note also that, since ttq = by definition, the null model is not 
part of the resulting ensemble. The compression coefficients can be justified as fol- 



lows (IBoulle , 



20071) : LLj + log P{sj ) "represents the quantity of information required 



to encode the model plus the class values given the model. The code length of the 
null model can be interpreted as the quantity of information necessary to describe the 
classes, when no explanatory data is used to induce the model. Each model can poten- 



^For this equivalence to hold, it is necessary computing the entropy using the natural logarithm, instead 
of the log2 as usual. 
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tially exploit the explanatory data to better compress the class conditional information. 
The ratio of the code length of a model to that of the null model stands for a relative 
gain in compression efficiency." 

With no loss of generality, assume the features to be ordered so that Ai,A2, . . . ,Aj^ 
yield a feasible model when used as super-parent; thus, SPODEs si, S2, . . . , s^, are 
feasible, while SPODEs Sj with j > ksae removed from the ensemble. The normalized 
compression coefficients P'{sj\'D) are obtained by normahzing the raw compression 
coefficients of the feasible SPODEs: 



P\s,\V) 







if J = 1, 



otherwise. 



(7) 



The posterior probabilities are estimated as: 



F(c|a)=^P(c|a,s,)-P'(s,|P). 



(8) 



We call this classifier COMP-AODE, where COMP stands for compression-based. 
COMP-AODE performs a weighted linear combination of probabilities estimated by 
different models; in risk analysis, a weighted linear combinatio n of probabihties esti 



mated by different experts is referred to as linear opinion pool (IClemen and Winkler 



19991) . 



2.5. COMP-AODE*: Extending COMP-AODE to Sets of Probabilities 

We extend COMP-AODE to imprecise probabilities by allowing for multiple spec- 
ifications of the prior P{S) over the models, collected into a credal set Vc{S), where 
the subscript denotes compression. Differently from the credal set V{S) used by the 
BMA-AODE*, here we also consider the null model. We assign to the null model a 
fixed prior probability e, while the prior probability of the SPODEs are free to vary un- 
der constraints analogous to those of BMA-AODE*; in this way we model a condition 
of prior ignorance. The credal set Vc{S) adopted by COMP-AODE* is therefore: 



Vc{S) 



P{S) 



P{so) = e, 



, , k, 



(9) 
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The bounds of the raw compression coefficient defined in Eq.® are obtained by 
letting vai-yP(S') in 7'c(5'): 



LL,j + log £ LLj + log (1 - fee) 

-nH{C) + log e ' -nH{C) + log e 



(10) 



Since the prior used by COMP-AODE (Eq. EJ belongs to the credal set of COMP- 
AODE*, the point estimate of the compression coefficient adopted by COMP-AODE 
(Eq|6]l lies in the above interval. Note also that the upper bound of the above interval 
{upper coefficient of compression) is obtained in correspondence of the extreme mass 
function which assigns prior probability 1 — fee to model Sj and prior probability e to 
all the remaining models. The various ttj cannot vary independently from each other; 
they are instead linked by the normalization constraint in Eq.®. 

COMP-AODE* regards SPODE Sj as non-feasible if the upper coefficient of com- 
pression is non-positive: this approach thus preserves all the models which are feasi- 
ble, in the sense of Section|23] for at least a prior in the set VdS). COMP-AODE* 
is thus more conservative than COMP-AODE, namely it discards a lower number of 
models. However, generally neither COMP-AODE* nor COMP-AODE remove any 
SPODE from the ensemble. Since the prior adopted by COMP-AODE is contained in 
the credal set of COMP-AODE*, the most probable class identified by COMP-AODE 
is part of the non-dominated classes identified by COMP-AODE*. u 

Like BMA-AODE*, COMP-AODE* identifies for each instance the non-dominated 
classes through maximality (Algorithm |2]i. In the following, we explain how to com- 
pute the test of dominance among two classes. 

Testing dominance 

Without loss of generality, we assume the features to have be re-ordered, so that 
the first k features yield a model with positive upper coefficient of compression when 
used as super-parent; in other words, SPODEs {si, . . . , s^} are the feasible ones. In 
this case the dominance test corresponds to evaluate whether or not the solution of the 
following optimization problem is greater than one. 



^Exception to this statement are in principle possible if the set of feasible SPODEs differs between 
COMP-AODE* and COMP-AODE. However, this did not happen in our extensive experiments. 
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, . . F(c'|a) Ei=i^(c'|a,s,)-7r, 
minimize: „, ,., , oc — 



P(c"|a) 2^^^P(c"|a,.,)-^, 



(11) 

w.r.t: P(so),-P(si),...,F(sfe) (12) 

subject to: P{sq) — e (13) 

P(sj)>e Vj-l,...,fc 

E-=i^(s.) = l, 

where the normalization term 2_/7=i ""j ^^^ been already simplified, being positive by 
definition. Recalling that P(so) — e and introducing Eq.© which shows how tt^ 
depends on the optimization variable P{sj), we rewrite the objective function as: 






and hence 



E,tiP(c>,g,) ■ {log e + LLo- LL,) -j:^^^,P{c'\a,Sj) -log P{s,) 
E5=i P(c"|a, s,) • (log e + LLo- LL,) - E.ti P(c"|a, s,) • logP(s,) ' 

We then introduce the constants a := X^,^i -P(c'|a, s/) (lege + LLq — LLj), b := 
E^=i ^(c"|a, sj) (lege + LLo - i^j). "j := ^(c'|a, Sj), /3j := F(c"|a, Sj). After 
changing the sign of both numerator and denominator of the objective function, we 
rewrite the optimization problem, with respect to the variables a;i, X2, . . . , Xj,, where 
Xj :— log P{sj), as follows: 



minimize: 




w.r.t.: 


Xu...,X^^ 



, #1,, 



subject to: Xj > log e Vj = 1 , . . . , A; 
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J2j=i exp Xj ^ I - e - {k - k)e. 

where the constrains are derived from the definition of credal set (|9]l. The last con- 
straint is justified as follows: (fc — k) models have been removed from the ensemble 
as unfeasible and therefore they do not appear in the optimization problem. Without 
changing the credal set, we set their priors to e; since these models do not impact on 
the objective function, the best solution is attained by allocating to them the minimum 
possible prior probability. We then substitute j/j := cxp.Tj to avoid numerical prob- 
lems in the optimization, thus getting the following non-linear optimization problem 
with linear constraints: 



mmiimze: 






(14) 
w.rt.: yi,...,y^ (15) 



subject to: yj > e, (16) 

2.6. Computational Complexity of the Classifiers 

We now analyze the computational complexity of the proposed classifiers and com- 
pare it with that of the standard AODE. We distinguish between learning and classi- 
fication complexity, the latter referring to the classification of a single instance. Both 
the space and the time required for computations are evaluated. The orders of magni- 
tude of these descriptors are reported as a function of the dataset size n, the number of 
attributes/SPODEs k, the number of classes I := \C\, and average number of states for 
the attributes v := k^^ X)i=i 1-^*1- ^ summary of this analysis is in Table[T]and the 
discussion below. 

Let us first evaluate the AODE. For a single SPODE s^, the tables P{C), P{A^\C) 
and P{Ai\C, Aj), with i = 1, . . . ,k and i ^ j should be stored, this implying space 
complexity 0{lkv^) for learning each SPODE and 0{lk'^v'^) for the AODE ensemble. 
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Algorithm Space Time 

learning/classification learning classification 

AODE OilPv^) 0{nP) 0{lk^) 

BMA-AODE/COMP-AODE OilPv^) 0{n{l + k)k) 0{lk^) 

BMA-AODE*/COMP-AODE* 0{lPv^) 0{n{l + k)k) 0{Pk^) 

Table 1: Complexity of classifiers. 

These tables should be available during learning and classification for both classifiers; 
thus, space requirements of these two stages are the same. 

Time complexity to scan the dataset and learn the probabilities is 0{nk) for each 
SPODE, and hence 0{nk^) for the AODE. The time required to compute the posterior 
probabihties is 0{lk) for each SPODE, and hence 0{lk'^) for AODE. 

Learning BM A- AODE or COMP-AODE takes the same space as AODE, but higher 
computational time, due to the evaluation of the conditional likehhood as in Eq.©. The 
additional computational time is 0{nlk), thus requiring 0{n{l + k)k) time overall. For 
classification, time and space complexity during learning and classification are just the 
same. 

The credal classifiers BMA-AODE* and COMP-AODE* require the same space 
complexity and the same time complexity in learning of their non-credal counterparts. 
However, credal classifiers have higher time complexity in classification. The pair- 
wise dominance tests in Algorithm|2]requires the solution of a number of optimization 
problems for each test instance which is quadratic in the number of classes. We can 
roughly describe as cubic in the number of variables the time complexity of solving the 
linear programming problem for BMA-AODE* and the optimization of the non-linear 
function, with linear constraints, for COMP-AODE*. Summing up credal classifiers 
increase of one unit, compared to their single-prior counterparts, the exponents of the 
number of classes and attributes in the time complexity of the classification stage. 

3. Experiments 

We run experiments on 40 data sets, whose characteristics are given in the Ap- 
pendix (Table IBT2] i. On each data set we perform 10 runs of 5-fold cross-validation. In 
order to have complete data, we replace missing values with the median and the mode 
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for respectively numerical and ca tegorical features. We d iscretize numerical features 



by the entropy-based method of JFavvad and Irani , 



19931) . For pairwise comparison 



of of classifiers over the collection of data sets we use the non-parametric Wilcoxon 
signed-rank testQ The Wilcoxon signe d-rank test is in deed recommended for compar- 



ing two classifiers on multiple data sets dDemsar , 



20061) : being non-parametric it avoids 



strong assumptions and robustly deals with outliers. 

3.1. Determinate classifiers 

We call determinate the classifiers which always return a single class, namely 
AODE, BMA-AODE and COMP-AODE. For determinate classifiers we use two in- 
dicators: the accuracy, namely the percentage of correct classifications, and the Brier 



-Li:(l-P(.<..|a<..,) , 



loss 



where rite denotes the number of instances in the test set, while P(c(') ja^*)) is the 
probability estimated by the classifier for the true class of the i-th instance. The Brier 
loss assesses the quality of the estimated probabilities in a more sensitive way than 
accuracy. 

A first finding is that AODE outperforms BMA-AODE, having both higher accu- 
racy (p-value < .01) and lower Brier loss. We present in Figure |3(a)] the scatter plot of 
accuracies and in Figure |4(a)] the relative Brier losses, namely the Brier loss of BMA- 
AODE divided, data set by data set, by the Brier loss of AODE. On average, BMA- 
AODE has 3% higher Brier loss than AODE. The fact that AODE outperforms BMA- 



AODE could be expected; th e same finding was already given in JYang et al 



2007h 



and in (iCerquides et al 



. 120051) . with the main difference that the BMA-AODE of these 
works was based on the marginal likelihood rather than on the conditional likelihood. 
Our results show that BMA-AODE is outperformed by AODE, even when using the 
conditional likelihood. BMA-AODE is outperformed by AODE both because its ex 



cessive concentration around the most probable model (IBoulle , 



2007 



Cerquides et al 



''For each indicator of performance, we generate two paired vectors: the same position in both vectors 
refers to the same data set. The two vectors are then used as input for the test. 
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2005t iDominposl l2000t iMinkal l2002h which tends to cancel the advantage of averag- 
ing over models, and because of the effectiveness of simply averaging over SPODEs, 
as done by AODE, in terms of reduction of the variance error 

As outlined by Figure [3(b)] the difference between accuracies is instead not signifi- 
cant when comparing COMP-AODE and AODE. However COMP-AODE outperforms 
AODE on the Brier loss (p-value < .01); in Figure |4(b)] we show the relative Brier 
losses, namely the Brier loss of COMP-AODE divided, data set by data set, by the 
Brier loss of AODE. Averaging over data sets, COMP-AODE reduces the Brier loss of 
about 3% compared to AODE. We see this result as noteworthy, since AODE is a high 
performance classifier These positive res ults with the c ompression-based approach 



broaden the scope of the experiments of (IBoulle , 



2007b . in which the compression 



approach was applied to an ensemble of naive Bayes classifiers. 



Q 




Q 



0.5 1 

Accuracy BMA-AODE 

(a) 




0.5 1 

Accuracy COMP-AODE 

(b) 



Figure 3: Scatter plots of accuracies; tlie solid line shows the bisector 



3.2. Credal classifiers 

A credal classifier can be seen as separating the instances into two groups: the 
safe ones, for which it returns a single class is returned, and the prior-dependent ones, 
for which it returns two or more classes. Note that prior-dependence is not an in- 
trinsic property to the instance: an instance can be judged as prior-dependent by a 
certain credal classifier and as safe by a different credal classifier. To characterize 
the performance of a credal classifier, the following four indicators are considered 



dCorani and Zaffalon , 



2008bl) : 
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(a) 



(b) 



Figure 4: Relative Brier losses; points lying below tlie horizontal line represent performance better than 
AODE, and vice versa. Note the different y-scales of the two graphs. 



• detenninacy: % of instances recognized as safe, namely classified with a single 
class; 

• single-accuracy: the accuracy achieved over the instances recognized as safe; 

• set-accuracy: the accuracy achieved, by returning a set of classes, over the prior- 
dependent instances; 

• indeterminate output size: the average number of classes returned on the prior- 
dependent instances. 

Averaging over data sets, BMA-AODE* has 94% detenninacy; it is completely 
determinate on 7 data sets. However, this determinacy fluctuates among data sets, 
showing for instance a significant correlation with the sample size n (p — 0.3). The 
choice of the prior is less important on large data sets: bigger data sets tend to contain 
a lower percentage of prior-dependent instances, thus increasing determinacy. BMA- 
AODE* performs well when indeterminate: averaging over all data sets, it achieves 
90% set-accuracy by returning 2.3 classes (the average number of classes in the col- 
lection of data sets is 3.6). It is worth analyze the performance of BMA-AODE on 
the prior-dependent instances. In Figure |5(a)] we compare, data set by data set, the 
accuracy achieved by BMA-AODE on the instances judged respectively as safe and 
as prior-dependent by BMA-AODE*; the plot shows a sharp drop of accuracy on the 
prior-dependent instances, which is statistically significant (p-value < .01). As a rough 
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indication, averaging over data sets, the accuracy of BMA-AODE is 83% on the safe in- 
stances but only 52% on the instances recognizes as prior-dependent by BMA-AODE*. 
Thus, on the prior-dependent instances, BMA-AODE provides fragile classifications; 
on the same instances, BMA-AODE* returns a small-sized but highly accurate set of 
classes. 



Accuracy BMA-AODE 



Accuracy COMP-AODE 





0.5 1 

prior-dependent instances 

(a) 



0.5 1 

prior-dependent instances 

(b) 



Figure 5: Accuracy of the determinate classifiers on tlie instances recognized as safe and as prior-dependent 
by their credal counteiparts. The accuracies of BMA-AODE [COMP-AODE] is thus separately measured on 
the instances judged safe and prior-dependent by BMA-AODE* [COMP-AODE*]. The solid line shows the 
bisector. 



Let us now analyze the performance of COMP-AODE*; it has higher determinacy 
than BMA-AODE*; averaging over data sets, its determinacy is 99%, with only minor 
fluctuations across data sets; the classifier is moreover completely determinate on 18 
data sets. The determinacy of COMP-AODE* is very high and stable across data sets. 
Therefore, under the compression-based approach only a small fraction of the instances 
is prior-dependent; this robustness to the choice of the prior is likely to contribute to the 
good performance of compression-based ensemble of classifiers and constitutes a desir- 
able but previously unknown property of the compression-based approach. Numerical 
inspection shows that the logarithmic smoothing of the models' posterior probabili- 
ties makes indeed the compression weights only little sensitive to the choice of the 
prior. COMP-AODE* performs well when indeterminate: averaging over all data sets, 
it achieves 95% set-accuracy by returning 2 classes (note that the indeterminate output 
size cannot be less than two). 
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Again, it is worth checking the behavior of the corresponding determinate classi- 
fier, namely COMP-AODE, on the instances that are prior-dependent for the COMP- 
AODE*. In Figure |5(b)| we compare, data set by data set, the accuracy achieved 
by COMP-AODE on the instances judged respectively safe and prior-dependent by 
COMP-AODE*; there is a large drop of accuracy on the prior-dependent instances, 
and the drop is significant (p-value < .01). Averaging over data sets, the accuracy of 
COMP-AODE drops from 82% on the safe instances to only 47% on the instances 
judged as prior-dependent by COMP-AODE*. Even COMP-AODE, despite its ro- 
bustness to the specification of the prior, undergoes a severe loss of accuracy on the 
instances recognized as prior-dependent by COMP-AODE*. On the very same in- 
stances, COMP-AODE* returns a small sized but highly reliable set of classes, thus 
enhancing the overall classification reliability. 

3.3. Utility-based Measures 

We have seen so far that the credal classifiers extend in a sensible way their deter- 
minate counterparts, being able to recognize prior-dependent instances and to robustly 
deal with them. Yet, it is not obvious how to compare credal and determinate classifiers 
by means of a synthetic indicator In fact, to fairly compare determinate and indeter- 
minate predictions is very challenging; to the best of our knowledge, a satisfactory so- 
lution exists only for 0-1 loss, while comparing determinate and indeterminate predic- 
tions in a cost-sensitive setting, in which different kind of errors imply different costs, 
is still an open problem. In the following we thus reason under 0- 1 loss. The discounted 
accuracy rewards a prediction made of m classes with 1/m if it contains the true class, 
and with otherwise. Discounted accuracy is then compared to the accuracy achieved 
by a det erminate classifier. A theoretical justification for discounted-accuracy has been 



given by lZaffalon et al.l (120111) showing that, within a betting framework based on fairly 
general assumptions, discounted-accuracy is the only score which satisfies some fun- 
da mental properties for assessing both determinate and indeterminate classifications. 



Yet 



Zaffalon et al. 



(1201 lb also shows some severe limits of discounted-accuracy, which 
we illustrate by means of an example: we consider two different medical doctors, doc- 
tors random and doctor vacuous, who should decide whether a patient is healthy or 
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diseased. Doctor random issues random diagnosis, using a uniform distribution over 
the two categories. Doctor vacuous instead always return both categories, admitting 
to be ignorant. Let us assume that the hospital profits a quantity of money propor- 
tional to the discounted-accuracy achieved by its doctors at each visit. Both doctors 
have the same expected discounted-accuracy for each visit, namely 1/2. For the hospi- 
tal, both doctors provide the same expected profit on each visit, but with a substantial 
difference: the profit of doctor vacuous is deterministic, while the profit of doctor ran- 
dom is affected by considerable variance. Any risk-averse hospital manager should 
thus prefer doctor vacuous over doctor random, since it yields the same expected profit 
with less variance. In fact, under risk-aversion, the expected utility increases with 



expectation of the rewards and decreases with their variance ( Lew and Markowitz , 
1979 ). To capture this point it is necessary introducing a utility funct ion, to be then 



applie d on the discounted-accuracy score assigned on each instance. In 



Zaffalon et al 



(120111) the utility function is designed as follows: the utility of a correct and deter- 
minate classification (discounted-accuracy 1) is 1; the utility of a wrong classification 
(discounted-accuracy 0) is 0; the utility of an accurate but indeterminate classification 
consisting of two classes (discounted-accuracy 0.5) is assumed to lie between 0.65 and 
0.8. Two quadratic utility functions are then derived corresponding to these boundary 
values, and passing respectively through {w(0) = 0, u(0.5) = 0.65, w(l) — 1} and 
{u(0) — 0,w(0.5) — 0.8, w(l) = 1}, denoted as ugs and wgo respectiveljo. Since 
u(l) = 1, utility and accuracy coincide for determinate classifiers; therefore, utility of 
credal classifiers and accuracy o f determinate classifiers can be directly compared. In 



del Coz and Bahamondd (120091) classifiers which return indeterminate classifications 
are scored through the Fi -metric, originally designed for Information Retrieval tasks. 
The Fi metric, when applied to indeterminate classifications, returns a score which 
is always comprised between ugs and ugo, further confirming the reasonableness of 
both utility functions . More details on the links between Fi, Ug5 and ugo are given in 



Zaffalon et al. 



( |2012|) . We remark that in real applications the utility function should 



^The mathematical expression of these utility functions are as follows: UQf,{x) = — 1.2a;^ + 2.2x, 
W8o(^) = —O.Qx^ + 1.6a;, where x is the value of discounted accuracy. 
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be elicited by discussion with the decision maker; in this paper we use U65 and ug,o to 
model two reasonable but different degrees of risk-aversion. 
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Figure 6: Relative utilities of credal classifiers compared to their precise counterparts. 

We now analyze the utilities generated by the various classifiers, comparing each 
credal classifier with its determinate counterpart. BMA-AODE* has significantly higher 
utility (p-value < .01) than BMA-AODE under both ugs and ugQ. This confirms that 
extending the model to imprecise probability is a sensible approach. In the first row of 
Figure |6] we show the relative utility, namely the utility of BMA-AODE* divided, data 
set by data set, by the utility (i.e., accuracy) of BMA-AODE; the two plots refer re- 
spectively to M65 and ugo- Averaging over data sets, the improvement of utility is about 
1% and 2% under wgs and ugo; although the improvement might look small, we re- 
call that it is obtained by modifying the classifications of the prior-dependent instances 
only, 6% of the total on average. If we focus on the prior-dependent instances only, the 
increase of utility generally varies between -1-10% and -1-40% depending on the data set 
and on the utility function. Clearly, the improvement is even larger under wgo which 
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assigns higher utility than ugs to the indeterminate but accurate classifications. 

The analysis is similar when comparing COMP-AODE* with COMP-AODE. In 
the second row of Figure |6] we show the relative utility, namely the utility of COMP- 
AODE* divided, data set by data set, by the utility (i.e., accuracy) of COMP-AODE. 
The increase of utility is in this case generally under 1%, as a consequence of the higher 
determinacy of COMP-AODE (99% on average), which allows less room for improv- 
ing utility through indeterminate classifications. In fact, the robustness of COMP- 
AODE to the choice of the prior reduces the portion of instances where it is necessary 
making the classification indeterminate. Focusing however on the (rare) indeterminate 
instances, the increase of utility deriving to the extension to imprecise probability lies 
between 39% and 60%, depending on the data set and on the utility function. Eventu- 
ally, COMP-AODE* has significantly (p- value < .01) higher utihty than COMP-AODE 
under both ugs and wgo; also in this case the extension to the credal paradigm is bene- 
ficial. 

The utilities of COMP-AODE* and BMA-AODE* are also compared; under ugs 
COMP-AODE* yields significantly (p-value < .05) higher utility than BMA-AODE*, 
while under uso the difference among the two classifiers is not significant, although 
the utihty generated by COMP-AODE* is generally slightly higher. The point is that 
BMA-AODE* is more often indeterminate than COMP-AODE*; under uso the inde- 
terminate but accurate classifications are rewarded more than under wgs, thus allowing 
BMA-AODE* to almost close the gap with COMP-AODE*. We conclude however 
that COMP-AODE* should be generaUy preferred over BMA-AODE*. 

Eventually we point out that COMP-AODE* generates significantly (p-value < 
.01) higher utility than AODE, under both wes and uso- The extension to imprecise 
probability has thus concretely improved the overall performance of the compression- 
based ensemble: recall that the determinate COMP-AODE yields better probability 
estimates but not better accuracy than AODE. 

3.4. Comparison with previous credal classifiers 

In this section we compare COMP-AODE* with previous credal classifiers. A 



well-known credal classifier is the naive credal classifier (NCC) (ICorani and Zaffalon 
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Figure 7: Comparison between credal classifiers by means of the Friedman test: the boldfaced points show 
the average ranks; a lower rank implies better performance. The bars display the critical distance, computed 
with 95% confidence: the performance of two classifiers are significantly different if their bars do not overlap. 



2008bl) . which is an extension of naive Bayes to imprecise probabiHty. We have ran 
NCC on the same collection of data sets following the experimental setup of Section|3l 
under both ugs and ugo, the utility produced by COMP-AODE* is significantly higher 
(p <0.01) than that produced by NCC. Thus, COMP-AODE* outperforms NCC. 

However, over time algorithms more sophisticated than NCC have been developed, 
such as: 



credal model averaging (CMA) dCorani and Zaffalon , 



2008al) . namely a gener- 



alization of BMA (in the same spirit of BMA-AODE) for naive Bayes classifier; 



credal decision tree (CDT) JAbellan and Moral , 



20051) . namely an extension of 



classification trees to imprecise probability. 
We then compare CDT, C MA and COMP -AODE* via the Friedman test; this is the 



approach recommended by (IDemsar 



20061) for comparing multiple classifiers on mul- 



tiple data sets. First, the procedure ranks on each data set the classifiers according to the 
utility they generate; then, it tests the null hypothesis of all classifiers having the same 
average rank across the data sets. If the null hypothesis is rejected, a post-hoc test is 
adopted to identify the significant differences among classifiers. Adopting a 95% con- 
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fidence, no significant difference is detected among classifiers; the result is the same 
under both utilities. However, under both utilities COMP-AODE* has the best average 
rank, as shown in Figure 13.41 Lowering the confidence to 90%, two significant dif- 
ferences are found: a) COMP-AODE* produces significantly higher utility than CMA 
under ugs and b) COMP-AODE* produces significantly higher utility than CDT un- 
der uso- These results, though not completely conclusive, suggest that COMP-AODE* 
compares favorably to previous credal classifiers. 

3.5. Some comments on credal classification vs reject option 



Determinate classifiers can be equipped with a reject option JHerbei and Wegkamp , 



20061) . thus refusing to classify an instance if the posterior probability of the most 
probable class is less than a threshold. For the sake of simplicity we consider a case 
with two classes only; to formally introduce the reject option, it is necessary setting a 
cost d (0 < d < 1/2), which is incurred when rejecting an instance. A cost 0, 1, d 
is therefore incurred when respectively correctly classifying, wrongly classifying and 
rejecting an instance. Under 0-1 loss, the expected cost for classifying an instance 
corresponds to the probability of misclassification; it is thus 1 — p*, where p* denotes 
the posterior probability of the most probable class. The optimal behavior is thus to 
reject the classification whenever the expected classification cost is higher than the 
rejection cost, namely when (1 — p*) > d; this is equivalent to rejecting the instance 
whenever p* < 1 — d, where (1 — d) constitutes the rejection threshold. 

The behavior induced by the reject option is quite different from that of a credal 
classifier, as we show in the following example. On an a very large data set the posterior 
probability of the classes is little sensitive on the choice of the prior, because of the 
wide amount of data available for learning; in this condition, instance are rarely prior- 
dependent and therefore a credal classifier will mostly return a single class. On the 
other hand, the determinate classifier with reject option (RO in the following) rejects 
all the instances for which p* < 1 — d; if d is small, there can be even a high number of 
rejected instances. The difference between these behaviors is due to the credal classifier 
being unaware of the cost d associated with rejecting an instance, which is instead 
driving the behavior of RO. To rigorously compare RO against a credal classifier, it is 
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thus necessary making the credal classifier aware of the cost d. RecalUng that the credal 
classifier already returns both classes on the instances which are prior dependent, this 
will change the behavior of the credal classifier only on the instances which are not 
prior-dependent. In particular, the credal classifier should reject all the instances for 
which p* < 1 — d, where p* is the lower probability of the most probable class; the 
instances rejected by means of this criterion will be thus a superset of those rejected by 
RO. Therefore, the credal classifier will reject the instances which are prior-dependent 
and those for which p* < 1 — d. Eventually, the cost generated by the credal classifier 
should be compared with those generated by the RO. In the case with more than 2 
classes the analysis might become slightly more complicated than what discussed here; 
however, we leave the analysis of credal classifiers with reject option as a topic for 
future research. Note also that this kind of experiment will require the computation of 
upper and lower posterior probability of the classes, which is not always trivial with 
credal classifiers. 

4. Conclusions 

Applying Bayesian Model Averaging over SPODEs actually worsens the classifi- 
cation performance compared to the standard AODE. Instead the COMP-AODE clas- 
sifier proposed here, which applies the compression-based approach over SPODEs, 
obtains overall slightly better classification performance than AODE; our results thus 



broadens the scope of (IBouUe , 



20071), in which the compression-based approach was 
applied over an ensemble of naive Bayes classifiers. The two credal classifiers BMA- 
AODE* and COMP-AODE* extend respectively BMA-AODE and COMP-AODE to 
imprecise probability, replacing the uniform prior over the SPODEs by a credal set, 
both credal classifiers automatically identify the prior-dependent instances, and cope 
reliably with them by returning a small-sized but highly accurate set of classes. On 
the prior-dependent instances both BMA-AODE and COMP-AODE undergo a severe 
drop of accuracy. Both BMA-AODE* and COMP-AODE* provide overall higher per- 
formance than their determinate counterparts as measured by the utility -based mea- 
sures, which to our knowledge constitute the state of the art for comparing determinate 
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and credal classifiers. According to the same metrics, COMP-AODE* shows better 
performance than previous credal classifiers. 
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Appendix A. Mapping linear-fractional programs to linear programs by the Charnes- 
Cooper transformation 

In this appendix, we adapt the classical Charnes-Cooper transformation to the par- 
ticular linear-fractional program to be solved to test dominance for the BMA-AODE* 
as described in Section [23] Let us write the optimization variables as Xj := P{sj) 
(with j — 1, . . . ,k) and the coefficients as: 



5, 



P{c'\a,nij) 
P{c"\a,mj) 

The objective function rewrites therefore as: 

1 = 1 li^j 



L, 



with j = 1, . . . ,k. Let us indeed change the variables as follows: 



(A.l) 



(A.2) 



Vj 



and introduce the auxiliary variable 



t := 



E," ^jXj 



1 



Ej Sj^j ' 

After this, non-linear, transformation, the objective function takes a linear form: 



E 



Tj-yj, 



(A.3) 



(A.4) 



(A.5) 



while each linear constraint Xj > e, rewrites as tjj > et, thus being still linear Simi- 
larly, the normalization rewrites as: 



J2y^=^- 



We have therefore mapped the original problem into a standard linear p rogram and the 



solutions of the two problems are known to coincide (IBaialinoM, l2003l Chap. 3). Note 



that the transformation only increases by one the number of constraints. 
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Appendix B. Data sets list 



Table B.2: List of the 40 data sets used for experiments. 



dataset 


n 


k 


classes 


dataset 


n 


k 


classes 


labor 


57 


11 


2 


ecoh 


336 


6 


8 


white .clover 


63 


6 


4 


liver .disorders 


345 


1 


2 


postoperative 


90 


8 


3 


ionosphere 


351 


33 


2 


zoo 


101 


16 


7 


monks3 


554 


6 


2 


lymph 


148 


18 


4 


monks 1 


556 


6 


2 


iris 


150 


4 


3 


monks2 


601 


6 


2 


tae 


151 


2 


3 


credit_a 


690 


15 


2 


grub .damage 


155 


6 


4 


breast_w 


699 


9 


2 


hepatitis 


155 


16 


2 


diabetes 


768 


6 


2 


hayesj-oth 


160 


3 


3 


anneal 


898 


31 


6 


wine 


178 


13 


3 


credit.g 


1000 


15 


2 


sonar 


208 


21 


2 


cmc 


1473 


9 


3 


glass 


214 


7 


7 


yeast 


1484 


7 


10 


heart± 


294 


9 


2 


segment 


2310 


18 


7 


heart_c 


303 


11 


2 


kr_vsJcp 


3196 


36 


2 


haberman 


306 


2 


2 


hypothyroid 


3772 


25 


4 


solarflare_C 


323 


10 


3 


waveform 


5000 


19 


3 


solarflareJVI 


323 


10 


4 


page.blocks 


5473 


10 


5 


solarflare_X 


323 


10 


2 


pendigits 


10992 


16 


10 


ecoli 


336 


6 


8 


nursery 


12960 


8 


5 
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